Improving G2p from wiktionary and other (web) resources
نویسنده
چکیده
We consider the problem of integrating supplemental information strings in the grapheme-to-phoneme (G2P) conversion task. In particular, we investigate whether we can improve the performance of a G2P system by making it aware of corresponding transductions of an external knowledge source, such as transcriptions in other dialects or languages, transcriptions provided by other datasets, or transcriptions obtained from crowd-sourced knowledge bases such as Wiktionary. Our main methodological paradigm is that of multiple monotone many-to-many alignments of input strings, supplemental information strings, and desired transcriptions. Subsequently, we apply a discriminative sequential transducer to the multiply aligned data, using subsequences of the supplemental information strings as additional features.
منابع مشابه
Using Wiktionary for Computing Semantic Relatedness
We introduce Wiktionary as an emerging lexical semantic resource that can be used as a substitute for expert-made resources in AI applications. We evaluate Wiktionary on the pervasive task of computing semantic relatedness for English and German by means of correlation with human rankings and solving word choice problems. For the first time, we apply a concept vector based measure to a set of d...
متن کاملAccessing and Standardizing Wiktionary Lexical Entries for Supporting the Translation of Labels in Taxonomies for Digital Humanities
We describe the usefulness of Wiktionary, the freely available web-based lexical resource, in providing multilingual extensions to catalogues that serve content-based indexing of folktales and related narratives. We develop conversion tools between Wiktionary and TEI, using ISO standards (LMF, MAF), to make such resources available to both the Digital Humanities community and the Language Resou...
متن کاملIntegrating WordNet and Wiktionary with lemon
Nowadays, there is a significant quantity of linguistic data available on the Web. However, linguistic resources are often published using proprietary formats and, as such, it can be difficult to interface with one another and they end up confined in “data silos”. The creation of web standards for the publishing of data on the Web and projects to create Linked Data have lead to interest in the ...
متن کاملDBnary: Wiktionary as a Lemon-based multilingual lexical resource in RDF
Contributive resources, such as Wikipedia, have proved to be valuable to Natural Language Processing or multilingual Information Retrieval applications. This work focusses on Wiktionary, the dictionary part of the resources sponsored by the Wikimedia foundation. In this article, we present our extraction of multilingual lexical data from Wiktionary data and to provide it to the community as a M...
متن کاملGrapheme-to-Phoneme Models for (Almost) Any Language
Grapheme-to-phoneme (g2p) models are rarely available in low-resource languages, as the creation of training and evaluation data is expensive and time-consuming. We use Wiktionary to obtain more than 650k word-pronunciation pairs in more than 500 languages. We then develop phoneme and language distance metrics based on phonological and linguistic knowledge; applying those, we adapt g2p models f...
متن کامل